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Abstract 

Background: The 2013 BioVis Contest provided an opportunity to evaluate different paradigms for visualizing 
protein multiple sequence alignments. Such data sets are becoming extremely large and thus taxing current 
visualization paradigms. Sequence Logos represent consensus sequences but have limitations for protein 
alignments. As an alternative, ProfileGrids are a new protein sequence alignment visualization paradigm that 
represents an alignment as a color-coded matrix of the residue frequency occurring at every homologous position 
in the aligned protein family. 

Results: The JProfileGrid software program was used to analyze the BioVis contest data sets to generate figures for 
comparison with the Sequence Logo reference images. 

Conclusions: The ProfileGrid representation allows for the clear and effective analysis of protein multiple sequence 
alignments. This includes both a general overview of the conservation and diversity sequence patterns as well as 
the interactive ability to query the details of the protein residue distributions in the alignment. The JProfileGrid 
software is free and available from www.ProfileGrid.org. 



Background 

Molecular biologists can learn about a protein's struc- 
ture and function by studying the natural sequence var- 
iation resulting from a protein family's evolution [1], A 
multiple sequence alignment (MSA) is crucial for such 
bioinformatic analysis to relate homologous residues to 
one another as pioneered by early molecular evolution 
studies (for example [2]). Software tools are now avail- 
able to align very large data sets consisting of 100,000+ 
sequences [3]. However, there is a lack of programs that 
allow users to visualize and to interact with very large 
MS As [4]. Existing MSA visualization tools [5] continue 
to use a simple stacked sequence alignment visualization 
paradigm representing all individual sequences as rows 
in a table and homologous residue positions as columns. 
This "row-column" paradigm was sufficient decades ago 
when alignments were small; but, the stacked sequence 
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representation is now inadequate for the amount of data 
being used today. 

Sequence Logos were introduced in 1990 as a new visua- 
lization paradigm to represent the MSA consensus residues 
enabling easy observation of conservation patterns in 
MSAs [6]. A hallmark of this paradigm is that a residue's 
frequency in the MSA column is depicted visually by the 
height of the residue single-letter symbol with all residues 
from the same column then stacked upon each other. 
These features are shared by subsequent derivatives of 
the Sequence Logo paradigm (Table 1). Modifications to 
the paradigm include different algorithms for determining 
symbol height [7-16], visualizing conservation within MSA 
subpopulations [17,18], replacing the symbols with one- 
dimensional [19] or three-dimensional [20] bars, comparing 
differences between two MSAs [11], identifying infrequent 
residues [15,21], handling specialized contexts (such as for 
RNA [7,22,23], structural elements [24], or codons [25]), 
and implementing online web servers [26]. 
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Table 1 Sequence Logo Derivatives 
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Despite their widespread use, though, Sequence Logos 
have their limitations. While Logos are suitable for nucleic 
acids with only four residue symbols legible at even small 
sizes, there are problems when representing 20 residues of 
protein sequences. Variable regions of a protein alignment 
and positions with rare residues are not legible in Logos 
since the symbol size is small when scaled relative to the 
conserved positions. This has been described as a "totally 
incomprehensible jumble of letters." [14] However, even 
for conserved residues, the stacking of the symbols can 
lead to confusion. For example, Schneider pointed out 
that an "F" on top of an "L" could be mistaken for a long 
"E." [6] Logos do not display any information about resi- 
dues missing from alignment columns and also lack a 
representation for gap symbols. Notably, Logos suffer 
from aesthetic challenges when two visualization channel 
types [27] are combined: symbol/stack height versus sym- 
bol color. Interpreting the tall/short stacked letter columns 
can be complicated by the color schemes used to distin- 
guish different amino acid residue classes (such as the pro- 
minent red coloring of acidic protein residues). Finally, a 
recent user study evaluated the effectiveness of Sequence 
Logos for comparing motifs [28] . The authors identified 
these problems: difficulty in judging the height of stacked 
symbols, no standardization for symbol color schemes, 
and a lack of interactivity for most current software tools. 

The challenge in the visualization of large sequence 
alignments is in identifying a paradigm that summarizes 



the overall conservation trends while still providing easy 
navigation to detailed views of the underlying data [4]. 
ProfileGrids as a MSA visualization paradigm were 
invented in 2005 for the analysis of the bacterial RecA 
protein family. MSA analysis had become impractical 
using the standard stacked sequence representation 
since a curated alignment had grown to several hundred 
homologs [29]. A ProfileGrid reduces an alignment to a 
matrix, color-shaded according to the residue frequency 
in the MSA [30]. The JProfileGrid Java program was 
upgraded to version 2.0 [31] with new software features 
as well as improvements to the aesthetics in the visuali- 
zation paradigm after receiving feedback from the 
VIZBI.org community [27]. The ProfileGrid paradigm 
has two simple but significant differences that avoid 
Sequence Logos limitations: 1) all residue symbols are 
the same height since the matrix row sizes are fixed; 2) 
color shading of the cells is used to represent the fre- 
quency distribution of the residues in the sequence data. 
Thus, the overall conservation trends can be seen from 
the "heat map." The JProfileGrid software tool is an 
interactive MSA viewer taking advantage of the matrix 
representation of alignments. 

The 2013 BioVis conference Redesign Contest pro- 
vided an opportunity to demonstrate the usefulness of 
the ProfileGrid paradigm by visualizing the adenylate 
kinase lid (AKL) protein family alignments [32]. Here, 
I report the final figures generated by the JProfileGrid 
software and the unique observations made possible by 
ProfileGrid analysis. 

Methods 

The protein sequence alignment data sets were provided 
by the 2013 BioVis Redesign Contest organizers. The 
Sequence Logos (Figure 1) were generated using the 
WebLogo server [26] and replicate the original BioVis 
contest figure to be "redesigned." Two versions of the 
JProfileGrid software, 1.22 [30] and 2.0.5 [31], have been 
described in previous publications. The latter version 
was used for this study and is available under a GNU 
General Public License at www.ProfileGrid.org. 

Results 

ProfileGrids clearly visualize protein residue distributions 

The 2013 BioVis Redesign Contest description "recog- 
nizes the importance of effective encodings and clear 
visual communication in display of complex quantitative 
information" and "gives participants the opportunity to 
develop a practical replacement to the long-standing 
convention of sequence logos" [32] as exemplified by 
the AKL domain Sequence Logos (Figure 1). Profile- 
Grids are a practical and effective replacement for 
Sequence Logos when analyzing protein alignments. 
Furthermore, the mature JProfileGrid MSA software 
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Figure 1 Sequence Logo visualization of the AKL alignments. Sequence Logos of the adenylate kinase lid (AKL) domain multiple sequence 
alignments (A) across all organisms for 1,809 protein sequences, (B) from Gram negative bacteria for 923 sequences, and (C) from Gram positive 
bacteria for 886 sequences. 



viewer has a rich graphical user interface allowing a 
molecular biologist to analyze their protein family of 
interest [30]. 

The AKL domain sequence alignments were imported 
into JProfileGrid for visualization and analysis to deter- 
mine the user settings for the final figures. Two visuali- 
zations were designed to demonstrate the ProfileGrid 
paradigm's capabilities as well as to represent observa- 
tions from the AKL domain MSAs. The entire align- 
ment of 1,809 sequences is shown in Figure 2A with the 
default JProfileGrid settings where the 20 protein resi- 
due rows are sorted alphabetically (with one additional 
gap symbol row). The frequency of the residues in each 
alignment is represented by color shades according to a 
blue ramp from low (white) to high (dark blue) conser- 
vation across the 36 residue width (columns) of the 
AKL sequences. The lower panels show the subpopula- 
tions for the 923 sequences from Gram negative bacteria 
(Figure 2B) and the 886 sequences from Gram positive 
bacteria (Figure 2C). The latter two panels show a single 
representative reference sequence at the top of the Pro- 
fileGrid for each respective subpopulation. 

Overall trends of sequence conservation and variability 
are quite clear from the ProfileGrids and the user can 
choose whether to show the details of the residue frequen- 
cies (values in panel Figures 2B and 2C) or to show just 
the residues observed (Figure 2A) at that MSA position, i. 
e. each column. Note that the ProfileGrid in Figure 2A can 
be directly related to the Sequence Logo (Figure 1A) 



where the residues shown are stacked upon one another 
in each column; and, the symbol height reflects the con- 
servation of each residue. The ProfileGrid, however, allows 
each symbol to be legible which is critical for interpreting 
a variable column such as position 14. The ProfileGrid 
allows the entire protein mutation distribution to be 
assessed. Similarly, rare residues are obvious such as at 
position 8 (Figure 2B) where the Redesign Contest instruc- 
tions incorrectly stated that "glycine is absolutely con- 
served in the Gram-negatives." [33] In fact, 2 sequences 
have asparagine at that position (Theileria parva strain 
Muguga [GenBank:XP_766154.1 http://www.ncbi.nlm.nih. 
gov/protein/7 1033025] and Theileria annulata [Genbank: 
XP 954152.1 http://www.ncbi.nlm.nih.gov/protein/ 
XP 954152.1]). Such rare observations would be impossi- 
ble to identify from the Sequence Logo paradigm (Figure 
IB) or its derivatives (the exception being InverseLogos 
depending upon user settings [21]). An AKL protein 
expert can now interpret whether the Gram negative 
homologs containing 8-asn are interesting natural 
sequence variation exceptions sharing a residue with 18 
Gram positive homologs (Figure 2C) or are just sequence 
errors resulting from experimental problems during data 
collection [30]). The interactive JProfileGrid user interface 
made it trivial to identify the two asparagine-containing 
sequences since the user can select any cell in the Profile- 
Grid window and then perform a query of the MSA. Thus, 
the ProfileGrid paradigm enables the MSA to be treated as 
a searchable database of sequence records. 



Roca BMC Proceedings 2014, 8(Suppl 2):S6 
http://www.biomedcentral.eom/1 753-6561 /8/S2/S6 



Page 4 of 8 



t \ 



A) 


1 

M.i'iiic. R 


- 


3 


i 


s 


i> 


7 


G 


9 


10 


1 1 

V 


i : 
Ji 


13 


M 


1? 


[li 

N 


IT 
I' 


EH 
P 


it 
K 


20 




11 




J-l 


- f 
1) 


:t.< 


27 28 29 
G 


Jit 


3] 

1 


U 


33 34 
R 


.1? 


36 
1> 




Ida 


A 


A 


A 


A 


A 


A 




A 


A 


A 






A 


A 


A 


A 


A 


A 


A 


A 


A 


A 


A 


A 


A 


A 


A A 


A 




A 


A 


A 




A 


CJ9 


C" 


C 




C 






c 


1 c 






c 






C 




C 






C 






C 


C 


C 




C 








C 








C 


asp 


D 






D 


1) 


D 


I> 




D 


1) 






1) 


I) 


D 


11 


D 




D 


□ 


1> 


D 


D 






1) 


D D D 


1) 


D 


11 




o 


EI 


» 


Slu 


E 




E 




E 


E 






f: 








K 


E 


E 


E 






E 


E 


E 


E 


E 


i: 


]; 


i: 


E E B 


K 


E 


E 


E 


E 




E 


pkt 


F 


L 1 


]■ 












p 




F 




F 


F 


D 


F 








1 


r 


P 


F 






j 




F 


F 


1 




F 




r 


Ely 


G 








a 


0 




O 


G 


G 








G 


G 


G 




G 


<. 


G 


<i 








G 


G 


Wim q 


G 








G 




G 


Ilis 


11 








1 9 


E 1 




] I 


1 




H 






E 1 


H 


] E 


1 1 


J ] 


1 1 


1 


] [ 


E 1 


H 






J ! 


1 3 3 [ 


H 




E [ 


1 1 


L 1 




II 


Ul- 


1 


I 


1 




1 


I 






1 


I 






I 


I 




1 


1 


I 


1 


I 


i 


I 


I 




1 


I 


I 


1 


I | 


] 


I 


I 






tra 


K 


EC 






K 


K 




K 


K 








K 


K 


K 


K 


K 


K 


o 


1 K 


K 


K 


K 


K 


K 


K 


K K 


K 




K 


K 


K 




K 


ku 




1. 


1 




L 


1. 




L 


L 


1. 






L 


L 


L 


L 


]. 






L 


[ 


L 


1. 


1. 




L 


L 


1. 


D 


1. 


L 


L 




1. 


mt L t 


M 


M 


M 




M 








M 


M 






M 


M 


M 


M 


M 




M 


M 


M 


M 


M 






M 




M 


M 


M 


M 






M 


a hi 


N 




v 


X 


N 


N 






N 


N 




N 


N 


N 


N | 




N 




N 


N 


N 


\" 






N 


K 


N N 


N 






\ 






N 


pn> 


P 








1* 


P 


p 






1' 












1' 






P 


P 


P 


P 


1' 


]' 


P 






P 








P 




P 


Kill 


Q 




Q 




V 


Q 












Q 


Q 


Q 


Q 


Q 


Q 


Q 


Q 


Q 


Q 


Q 


Q 




Q 


Q 


Q Q 


Q 


Q 


Q 


H Q 


Q 




y 


;ir}i 


h n 








B 


R 






1 








R 


R 




R 


R 




R 


u 


K 


li 


it 


R 


It 


R 


R 


li 




R 


li E] 


R 




R 


ser 


s 


S 


s 


s 


S 


s 


■ 


1 s 


S 


s 


s 


s 


$ 


S 


$ 


S 


$ 


s 


S 


s 


s 


s 


$ 


S 


S 


S 


S S S 


s 




S 




S 




s 


Ibr 


T 




T 




T 


1 


T 




T 


T 






T 


T 


T 


1 


I 


T 


1 


T 


T 


T 


T 




T 


T 




]' 


T 


T 


i 


T 




1 


val 


V 


V 


V 




V 


V 






V 


V 






V 


V 


V 


V 


V 




V 


D 


V 


V 


V 




V 


V 


V 


V 


V 


V 


V 


V 




V 


lip 


w 


w 


















Vi 








w 






W 








w 


















w 






\\ 


lyr 


¥ 


Y 


Y 
















D 




Y 


Y 


Y 


Y 










Y 


Y 


Y 






Y 






Y 


Y 


Y 






Y 



| Gramme R W T H L A SGRTYNYKFNPPKQYGKDD I 

ala A 125 1 87 1151 2 2 2 I 41 27 4 13 84 3 L) 

cys C 3 

up n l 

»lu E 35 2 

phc F 3 II 

Bly G 65 

his h nn 

lit I 158 7 

lys K 18 12 68 296 15 H M 5 I BU 1 

ten L 124 3 1514 53 14 4 5 2 2 1 36 36 

met M 1 3 6 3 4 I 1 1 2 98 2 3 

asn N 3 

pro P 297 324 

Sin Q 1_ II 132 1 

ars K 





S 




15 


7 


40 


ilir 


T 




'5 


2 


9 


Vill 


V 






35 


1 


up 


W 


143 








lyr 


V 


12 


4 






jj.ip 




3 3 









228 
318 
331 



Y 


K 


F 


N 


P 


1' 


K 


Q 


Y 


t; 


K 


2 


2 


i 


1) 


27 




4 


13 


84 


3 




I 


1 

17 


1 


1 

6 


1(1 




1 


1 


-ii 


3 

57 




45 


90 










2 


107 








1 


3 


E3 


1 












IS 


3 




13 


i 




















1 


25 


9 






7 




17 




6 


179 


53 












In 






IK 


68 






15 










5 


i 


E? 


53 


14 


4 








5 


2 


2 


i 




3 


1 










1 


1 


2 


98 


2 


43 


2 


1 








1 


23 


4 


12 














E1E 






139 






1 


20 




« 


8 


2 


11 


26 


9 




8 


II 


14 




5 








3 


2 






14 


7 


2 


8 


3 


3 




5 


6 






95 


S3 


2 


2 




1 


2 


57 


3 




35 


3'): 


293 


2 




3 




1 


Il 15 




55 


7 




1 

;?2 


2 










1 




1 



D 1, 






H 


E 


l> 




31 


18 






135 




A 




1 










C 


302 








239 1 


IE 


" 


| 199 


6 






133 




E 




2 










r 










1 




G 




1 


4 




1 




II 


1 1 


379 


209 










6 




39 




211(1 




K 


2 F77 


1 


1 




4 




L 


2 


l(. 










M 










2 




N 


368 








54 




!■ 


1 




581 


12 


19 




V 


3 




1 1 


12 


3 




K 


6 


12 


4 




107 




S 


1 


I'll! 


142 




16 




1 




278 


87 




2 




\" 






3 








s 




2 














2 


3 


7 


7 


9 







Irara pus 


R R 


1 


C 


V 


c 


C 


G 


1 


A 


K 


N Y 


V 


M 


E 


1' 


P 


K 


K 


E 


G 


1 


C 


1) 


A 


c 


G 


G 


K 


1. 


V 


V 


i 1) 


l> 




ala 


A 


1 


27 


1 


I'l 


50 




21 


E! 






4 


5 


9 


53 


84 


2 


281. 


38 


287 


9 


2 


3 7 


21 


34 




3 


51 


12 






2 


121 




A 


cys 


C 






m 








2 














2 














I 












7 














( 


[IS]1 


II 






36 


4 


33 


3 




7 


5 




so 


33 


7 


33 


4 




3 


23 


143 


263 






EE 


L 


96 


l l 


30 


111 


E 


12 




1.17 1 


EE 


" 


Bin 


i: 




2 




II 


253 






65 






17 


287 


1 


7 






38 


59 


2 ')!. 


15 


6 


10 


II 


25 


2 


2 


29 


EES 


1 


6 


1 


140 




E 


phe 


i 


fid 


9 












4 




72 






HE 


11 








1 


1 




7 






5 








2 


120 


37 




12 




F 


s'y 


G 








18 


56 




nn 


23 


52 






2 




2 




1 


1 


1 


12 


E21 






3 


2 








12 








3 




<; 


his 


11 






59 


1 


5 




i 


12 






nn 


26 




23 




1 


6 




8 


5 


1 






4 




6 


1 1 


8 




3 


9 






il 


lie 


1 


6 


fctti 




1 


1 






1 


148 




164 


78 




6 


1 


III 


2 


53 


3 


2 


155 




1 


HI 






1 




II 


144 


K 


2 






lys 


K 


1 






3.18 


80 




17 


81 






73 


112 


i 


125 


48 


1 


398 


291 


16 


11 


131 


1 


1 


511 




16 


9 


75 




50 


7 


86 




K 


ll'U 


[. 


33 


32 




12 


6 




1 


5 


1 




354 


68 


9 


52 


J 




9 


7 


2 


4 


33 


2 




55 




1 




1 


GEE 


5 


27 






1. 


met 


M 


1 


42 












6 


5 




9 


8 


$ 


1 


7 




1 


: 


3 


9 








1 










IK 


5 


1 






M 


asn 


y> 




1 


1 


1 


!35 




13 


17 


Id 




168 


5 


20 








4 


15 


23 


42 






9 


15 




ID 


11 


8 




1 


2 


10 




N 


inn 


P 








170 


12 


1 






2 












SB 


EES 




4 


65 


3 


6 


1 












43 








54 




1' 


glD 


Q 




5 




14 


2^ 




8 


5 


1 






14 


5 


13 


47 


2 


31 


28 


2 


3 


7 




2 


4 




29 


I s 


20 


4 


3 


IBB 


6 




<J 


»rg 










100 


■1 




1 


88 






12 


22 




II 


15 




23 


19 


2 


1 


17 


1 


5 


62 




2 




36 






'5 i 


a 2 




R 


B« 


S 


26 


44 


31 


40 


123 


68 


4 


ll> 


100 


3 


1 6 


26 


46 


39 


2 


32 


if. 


6 


8 


7 


2 


1 


9 


In 


9 


1') 


113 


Id 




4 


7 


276 




S 


■ in- 


1 




126 




17 


54 


49 




58 


131 




25 


37 


65 


3 


9 


87 


I'l 


■15 


2 


6 


88 




8 


6 


40 




24 


43 


6 


39 


19 






1 


sal 


V 


2 


111 




25 


11 








121 




135 


101 




1 


6 




IS 


275 


3 


4 


112 




2 


133 






ID 


10 


34 


105 


6 


5 




V 


lip 


\v 


50 


















11 






2 






9 








4 




























\\ 


lyr 


\ 


67 






















51 


162 


1 












1 


5 






5 










3 


157 


13 






Y 


gap 




4 4 


3 


3 


El 


3 


3 


3 


.3 


1 


1 


1 I 


3 


1 


1 


2 


6 


3 


s 


10 


11 


II 


13 


15 


61 


60 


61 


61 


34 


23 


23 


24 24 2S 


32 





Threshold: 0 10 25 50 70 90% 
Color; I 

Figure 2 ProfileGrid visualization of the AKL alignment protein residue distributions. ProfileGrids showing the amino acid usage in the 

adenylate kinase lid (AKL) domain (A) across all organisms for 1,809 sequences, (B) from Gram negative bacteria for 923 sequences, and (C) from 

Gram positive bacteria for 886 sequences. The ProfileGrid panels show a representative sequence from the Gram negative (B) and Gram positive 

(C) alignments while panel (A) shows a majority consensus. For each panel, the color shading is normalized to the number of sequences in each 

alignment as a color ramp from white (<10%) to dark blue (>= 90% conservation). This figure was exported from the JProfileGrid software and 

takes advantage of many user defined options such as representing the residue symbols (A) versus the frequency counts (B, C). 
* J 



ProfileGrids can effectively visualize alignment differences 

The 2013 Bio Vis Redesign Contest challenged applicants to 
help the AKL biologists understand the differences 
between the Gram negative (Figure IB) and Gram positive 



(Figure 1C) sequences by creating alternatives to the 
Sequence Logo representation. Visualizing the protein 
sequence differences between the two subpopulations 
would presumably allow the biologists to propose structure 
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and function hypotheses about the AKL protein activities. 
The original figure designed by the AKL biologists is of 
three stacked Sequence Logos (Figure 1). Drawing compari- 
sons between protein Sequence Logos is very difficult espe- 
cially since there is no visual encoding for differences 
(although attempts have been made to introduce such a fea- 
ture [11,17,18]). In particular, a user will be frustrated per- 
forming mental comparisons on adjacent Sequence Logos 
such as when comparing the Gram negative (Figure IB) 
and Gram positive (Figure 1C) sequences. Such challenges 
were documented in a recent Sequence Logo user study 
[28]. By contrast, the interactive JProfileGrid software is 
especially suited for comparing sequences populations to 
one another. 

ProfileGrids can effectively highlight differences between 
the AKL data sets (Figure 3). For this purpose, we calcu- 
lated a consensus sequence for each subpopulation and 
list them both at the top of each subpopulation ProfileGrid 
(Figure 3B & 3C) where the first row is the "reference" and 
the second row is the "highlight" sequence to be com- 
pared. For each column, if the highlight sequence differs 
from the reference, then the residue cell is marked with a 
pink border automatically generated by the JProfileGrid 
software [30]. Viewing the "highlight boxes" within the 
context of the entire reference subpopulation alignment 
clearly shows the differences between the Gram negatives 
(Figure 3B) and Gram positives (Figure 3C) since the high- 
light consensus acts as a proxy for its entire alignment. For 
the sake of visual clarity, the residue frequency shading is 
grayscale (instead of the default blue ramp) and the values 
are not shown in the cells. 

The ProfileGrid paradigm can allow color coding to 
visually represent amino acid classes as used in Sequence 
Logos. However, in this case the coloring is minimized to 
provide just enough information rather than dominate the 
figure and confuse the user. Figure 3A shows a color 
scheme to represent hydrophobic (brown), basic (blue), 
nitrogen-containing (lavender), acidic (red), and other 
(green) residues. Note that brown is used instead of black 
(as in the default AKL Sequence Logo coloring of Figure 1) 
to avoid visually clashing with the frequency grayscale 
encoding. Thus, as a reader scans across the ProfileGrid, 
the row position will encode the amino acid class in a man- 
ner complementary to the frequency information. 

The ProfileGrid highlight (pink cell border) feature takes 
advantage of "visual popout" [27] to allow the user to 
make observations. AKL alignment position 9 is a good 
case study to examine differences between the Gram nega- 
tive and Gram positive subpopulations. In Figure 3B argi- 
nine (positively charged coded "blue," middle rows as per 
Figure 3 A) is very well conserved (black cell; row 13, col- 
umn labelled "9") in Gram negatives. However, an alanine 
(Figure 3B; pink borders; row 3, column labelled "9") is 
found in Gram positives and this residue is small and 



hydrophobic (brown, top rows). Intriguingly, examination 
of the Gram positive sequences (Figure 3C) shows that 
there are at least two predominant subpopulations 
observed (see Figure 2C for the complete residue distribu- 
tion at position 9): the aforementioned alanine (gray box; 
row 3, column 9) but also the arginine (gray box with pink 
borders; row 13, column 9) shared with the Gram nega- 
tives. Perhaps the AKL experts can determine if there is 
any biological significance to these qualitative observa- 
tions; but, this shows that the ProfileGrid paradigm allows 
these observations to be made at all. 

JProfileGrid program features 

The interactive JProfileGrid program viewer has features 
designed for the biologist user [30,31]. In software version 
2.0, a new "overview" mode allows the visualization of the 
entire MSA within one window as either a ProfileGrid or as 
stacked sequences. Individual ProfileGrid cells can be 
selected to extract sequence subsets of interest during a 
visualization dissection. Sorting the residue rows by physi- 
cal/chemical properties such as flexibility, helix propensity, 
hydropathy, and volume allow qualitative structural analyses 
to be performed. The detailed ProfileGrid window with the 
symbol counts, has a new second pane to view different 
parts of the MSA at the same time. The "highlight" feature 
can identify residues that occur greater or less than a user- 
defined threshold of residue frequency. Large alignments 
can be separated into subsets of interest by using metadata 
filtering once JProfileGrid imports simple sequence annota- 
tions from flat file spreadsheet databases. The interactive 
features of the JProfileGrid program can be more easily 
appreciated by a live walk-through as demonstrated by my 
2013 Bio Vis Data Contest movie submission [34]. 

ProfileGrid advantages over Sequence Logos/Bundles 

ProfileGrids avoid the limitations of Sequence Logos 
especially for protein MSAs. Namely, all ProfileGrid resi- 
due symbols are legible so that no information is lost for 
bioinformatic analysis. ProfileGrids display gap symbols 
as well as the regions of a MSA where residues are not 
observed. Neither case is handled by Sequence Logos. 
The ProfileGrid paradigm, in the opinion of this author, 
clearly and effectively separates visual encoding channels 
to be more aesthetically pleasing than Sequence Logos. 
Such clarity allows for the careful dissection of sequence 
conservation patterns in MSAs by molecular biologists. 
Importantly, ProfileGrids solve the visualization problem 
of handling very large alignments since there is almost 
no limit to the number of sequences that can be repre- 
sented. The matrix representing the MSA is only 21 rows 
rather than the inefficient stacked sequence representa- 
tion that lists every individual sequence as a row. While a 
Sequence Logo can also represent very large MSAs, the 
details of the underlying sequence information are lost as 
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Figure 3 ProfileGrids highlighting differences between AKL alignment subpopulations The JProfileGrid software offers many user defined 
options to customize the appearance of the ProfileGrid for final figure export here shown for the AKL alignment across all organisms for 1,809 
sequences (A), from Gram negative bacteria for 923 sequences (B), and from Gram positive bacteria for 886 sequences (C). The ProfileGrid panel 
first row shows the consensus sequence from the Gram negative (B) and Gram positive (C) alignments while panel (A) shows the majority 
residues from the entire MSA. The pink boxes in panels (B) and (C) highlight differences between the two consensus sequences shown. For each 
panel, the color shading is normalized to the number of sequences in each alignment as a grayscale ramp from white (<10%) to black (>= 90% 
conservation). For clarity, the ProfileGrid cells in panels (B) and (C) do not show any values or symbols to facilitate visual inspection of the 
patterns. The residue codes in the panel (A) second column are color shaded to represent the following amino acid classes: hydrophobic 
(brown), basic (blue), nitrogen-containing (lavender), acidic (red), and other (green). The ProfileGrid rows are sorted according to the amino acid 
classes whereas in Figure 2 the rows are sorted alphabetically by residue symbol. 
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in most consensus paradigms. The interactive JProfile- 
Grid viewer allows the user to retain access to the 
sequence data for a robust protein family analysis. 

Finally, we comment on the Redesign Contest appli- 
cants. Most of the submissions were derivatives of 
Sequence Logos and so would have the same aforemen- 
tioned limitations. Interestingly, the two entries that 
received Honorable Mentions [35,36] have replaced the 
Sequence Logo representation with a different visualiza- 
tion paradigm that one entrant named "Sequence Bun- 
dles." We note that by representing the MSA as a matrix 
with residue symbols as rows and the alignment positions 
as columns, they are handling the underlying data as a 
profile for bioinformatic sequence similarity searches [37] 
using command-line programs with no graphical user 
interfaces. The 2008 ProfileGrid publication, to the best 
of my knowledge, is the first demonstration of profiles as 
a visual paradigm for analyzing MSAs. How Sequence 
Bundles differ significantly from ProfileGrids is in the 
former's use of lines to connect matrix cells along the 
path of an individual protein sequence. More sequences 
occurring in an alignment result in lines stacking on the 
matrix visualization allowing consensus sequences to 
emerge as a thick bundle. 

While the Sequence Bundle connecting lines are aesthe- 
tically interesting, they are also a liability since as the 
MSAs get very large, the visualization will become more 
cluttered. In my opinion, what is relevant to the biologist 
user within the context of an MSA is the frequency of the 
residues in the visualization. After an observation has been 
made, then the user will dig deeper to identify the particu- 
lar sequences of interest. The interactive JProfileGrid pro- 
gram was designed for such a task. Individual sequences 
are shown alongside the top row of the ProfileGrid by 
choosing the desired reference and highlight sequences 
from JProfileGrid menus. Selecting individual ProfileGrid 
cells allow a separate JProfileGrid window to display the 
subset of the sequences containing that residue in the 
MSA. There is no need for connecting lines to be layered 
on top of the matrix as in Sequence Bundles. 

Conclusions 

While Sequence Logos have their merits for visualizing 
short conserved motifs especially in nucleic acid 
sequences, this paradigm has limitations when repre- 
senting protein alignments. The ProfileGrid paradigm 
replaces Sequence Logos and solves the challenge of 
visualizing large protein alignments. Thus, ProfileGrids 
allow a molecular biologist to clearly and effectively ana- 
lyze protein structure and function. 
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